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Abstract 


There is a long history of devleopment of methodology dealing with missing data in statistical 
analysis. Today, the most popular methods fall into two classes, Complete Cases (CC) and Multiple 
Imputation (MI). Another approach. Available Cases (AC), has occasionally been mentioned in the 
research literature, in the context of linear regression analysis, but has generally been ignored. In 
this paper, we revisit the AC method, showing that it can perform better than CC and MI, and we 
extend its breadth of application. 
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1, Introduction 


For concreteness in this introduction, consider a classic linear regression analysis, based on 
a data matrix D = {Dij) of n rows and p + 1 columns, with the first p columns containing 
the values of the predictor variables and the last column consisting of values of the response 
variableQ. Some of the elements of the matrix may be missing, a condition that is in the R 
language denoted by NA. 

A wide variety of methods have been developed to deal with the missing values. The most 
popular fall into one of two categories, again described in our regression analysis context 
for convenience: 

• Complete cases (CC)JlHere one deletes any row in the data matrix that has at least 
one NA value. 

• Multiple Imputation (MI): These methods involve estimating the conditional dis¬ 
tribution of a variable from the others, and then sampling from that distribution via 
simulation. Multiple alternate versions of the data matrix are generated, with the NA 
values replaced by values that might have been the missing one. 

Here we are interested in a third approach: 

• Available Cases (AC)|llf the statistical method involves computation involving, say, 
various pairs of varaibles, include in such a calculation any observation for which this 
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pair is intact, regardless of whether the other variables are intact. The same holds for 
triples of variables and so on. 


For example, as will be detailed below, linear regression analysis only involves computation 
of certain pairwise-intact values of the form 

1 

-y^DirDis ( 1 ) 

Thus we may compute ([Til for all rows i for which both dir and diS are intact — even if 
some other dik are missing. In ([Til the factor 1/n would be changed to 1/Nrs, where Nrs 
is the number of rows with intact (r, s) pairs in the matrix, as in for example (Cohen and 
Cohen, 1983). In other words, ([Hi becomes 

1 "■ 

d^rs — "TT ^ ' IrsDirDig (2) 

where Irs is 1 or 0, depending on whether Dir and Dig are intact. 

(As noted, there are important assumptions underlying these methods, but we defer discus¬ 
sion on this to Section 0) 

Though AC was considered in the early literature on missing data, over the years, MI 
methods became more and more sophisticated, and they enjoy high popularity today. In 
R, for instance, there are packages Amelia (Honaker et al, 2011), mi (Su et al, 2011) and 
mice (van Buuren, 2011) that apply MI techniques. See (Little et al, 2002) for very detailed 
coverage, or http .-//sites. stat.psu.edu/^jls/mifaq.html for an overview. 

Concurrently, interest in AC waned, not only due to its stringent assumptions but also out 
of a concern that the cross products matrix whose elements are given by ([H may not be 
positive definite. 

We believe that AC can be a very useful tool. As Marsh notes (Marsh, 1998), AC “follows 
naturally from a desire to use as much of the data as possible.” We will show here that AC 
can indeed yield significant improvements in statistical accuracy over CC, while avoiding 
the very slow computational speed of MI. In addition to investigating the standard AC 
application of linear regression nodeling, we also investigate principal components analysis 
(PCA) and analysis of contingency tables. We make software available to implement these 
methods. 


2. Choice of MI Method 

We chose Amelia as our representative MI method, arbitrarily using the criterion that it has 
the most citations on Google Scholar. Under the assumption that the population distribution 
of the rows of D is multivariate normal, an outline of its approach is as follows. 


Starting the with original data Dij, m perturbations of this data Dij^ are created, 
k = 1,..., m, through bootstrap sampling. Note that these new data sets do contain 
NA values. 



• For /c = 1, m, do: 

- Replace the NAs by Os. 

- Use the EM algorithm and the assumption of multivariate normality to estimate 
the population mean vector and covariance matrix from this data. 

- Replace each NA value by an imputed one, consisting of a value drawn at 
random from the estimated conditional distribution of this variable, given the 
intact values of the other variables. 

• Combine the m data sets, say by averaging the m values of a quantity of interest, 
such as a regression coefficient. 

In our initial empirical investigation, we quickly found that Amelia was not performing 
well: 


• Its statistical accuracy was no better than those of CC and AC. 

• It was slow. For instance, in a PCA simulation with n = 10000 and p = 25, CC and 
AC took 0.011 and 1.967 seconds, respectively, while MI took 92.928 seconds. 

For this reason, we will present empirical results here only for the CC and AC methods. It 
is crucial to keep in mind, though, that CC and AC require more stringent assumptions than 
MI. Thus later in this paper we will return to MI in general, and Amelia in particular. 

3. AC in Linear Regression Models 

As noted, in the literature, AC has mostly been considered in the context of linear regres- 
sion0 Thus we will begin there. 


3.1 Motivation and Method 

Consider the case of random-X regression. Define fhe mafrix U and fhe vecfor U fo be 79 
minus fhe lasf column, and fhe lasf column of 79, respectively. Then fhe classic formula for 
fhe vecfor of esfimafed regression coefficienfs, assuming infacf dafa, is 

{U'U)-\U'V) = Q^U'U 
which as re — oo converges fo 


1 , 

-U'V 

re 


(3) 


[E{XX')]-^E{XY) (4) 

where fhe random column vecfor X and and fhe random scalar variable Y have fhe popu¬ 
lation disfribufion from which fhe rows of U and elemenfs of V are sampled. 

The poinf is fhaf fhis convergence sfill holds if in (|3]l, we replace fhe (r, s) elemenf of U'U 
in (l3]l by Q, r, s = 1, ...,p and replace elemenf r in U'V by dUl with s = p + 1. 

"^Actually, to our knownledge, in the litetature to date, AC has only been applied to covariance-related 
methods, including linear regression. 



3.2 Implementation 


R code for use of AC as a replacement for lm() is available in two implementations (not just 
two locations), a function lmmv() at https://github.com/maxguxiao/Available-Cases, and a 
function lmac() in the regtools package at https://github.com/matlojf/regtools. 

The latter takes advantage of the fact that R’s cov() function offers an argument option 
use=pairwise.complete.obs, which applies AC to finding covariance matrices, which in 
turn can be used to estimate regression coefficients:: 

# arguments: 

# x: predictor values (no Is column) 

# y: response variable values 

Imac <- function (x,y) { 
p <- ncol(x) 

tmp <- cov(cbind(x,y),use='pairwise.complete.obs') 

upu <- tmp[l:p,l:p] 

upv <- tmp[l:p,p+l] 

bhat <- solve(upu,upv) 

bhatO <- 

mean(y,na.rm=TRUE) - colMeans(x,na.rm=TRUE) %*% bhat 
c(bhatO,bhat) 

} 

This works because for centered data, dUl is equal to 


Cov{X)-^Cov{X,Y) (5) 

Since the use=pairwise.complete.obs option in R’s cov() uses the AC method, this gives 
us AC estimation for linear regression. 

The above code, lmac(), is much faster than lmmv(), since R’s cov() function operates at 
C-level, as opposed to the use of R for() loops in lmmv(). 


3.3 Standard Errors for the Coefficients 

Our code computes standard errors for the estimated regression coefficients in two different 
ways. 

lmmv(): 

The lmmv() function uses the delta method, together with numerical calculation of deriva¬ 
tives using the numDeriv package. Any component of (O is a function of the K^s in © 
forl<r<s<p+l. The function genD() in numDeriv is then used to compute the 
numerical gradient G of this function. 

The standard error is then 


VG'BG 


( 6 ) 



NA rate 

CC var. 

AC var. 

0.01 

0.008034006 

0.002094305 

0.05 

0.05018815 

0.01230746 

0.10 

0.1421812 

0.02398466 


Table 1: Pima Data, Linear Regression 

where B is the estimated covariance matrix for the Krs (conditional on the Nrs)- We have 


Covi^Kabi Kcd) 




(V) 


There are various ways to evaluate this, such as calling Cov() with the pairwise.complete.obs 
option. 

lmac(): 

In lmac(), we simply use the bootstrap to generate standard errors. Though this may seem 
more time-consuming than using the delta method, this consideration is countered by the 
fact that genD() is written in R rather than C, and thus involves slow loop computation. 


3.4 Empirical Evaluation 

We present here simulations that run on real or simulated data. The idea is that, starting 
with a given data set, in each repetition of the simulation, random NA values are inserted, 
and the value of j5i is recorded. The variance of such values for lmac() is compared to that 
for lm(); the latter represents CC, as its method of handling NAs is cell 

We tried it for several real data sets. One is the Pima study at the UCI Machine Learn¬ 
ing Repository, https://archive.ics.uci.edu/ml/datasets/Pima+lndians+Diabetes. Here n = 
768 and p = 8. We took blood pressure as our response variable, and all the other variables 
as predictors. Results for inserting 1%, 5% and 10% NAs were as shown in Table[T] 

AC was much more accurate, especially with the heavier NA rate. 

We also tried the method on some Census data, concerning programmers and engineers in 
Silicon Valley. (This data set is available in the regtools package.) Here, n = 20090 and 
p = 11. The results are shown in Table |2] 

Next, we considered the baseball player data set in CRAN’s freqparcoord package (Mat- 
loff and Xie, 2014), which consists of data on height, weight, age and position for 1015 
major league playersH In predicting weight from only height and age, there appeared to 
be no real difference in the accuracy of CC and AC; see Table [3] However, when playing 
position was added to the prediction, with dummy variables for infielders, outfielders and 
pitchers 0 AC greatly outperformed CC, as seen in Table HI 

^The means values for the two methods were virtually identical. 

^Data courtesy of the UCLA Statistics Department. 

’The remaining categories are catchers and, in the American League, designated hitters. 










NA rate 

CC var. 

AC var. 

0.01 

0.4694873 

0.1387395 

0.05 

2.998764 

0.7655222 

0.10 

8.821311 

1.530692 


Table 2: Census Data, Linear Regression 


NA rate 

CC var. 

AC var. 

0.01 

0.001587028 

0.001587711 

0.05 

0.009455012 

0.009799962 

0.10 

0.02019519 

0.01996154 


Table 3: Baseball Data I, Linear Regression 


NA rate 

CC var. 

AC var. 

0.01 

0.00354029 

0.00211546 

0.05 

0.02160327 

0.01146491 

0.10 

0.05171839 

0.02553519 


Table 4: Baseball Data 11, Linear Regression 












































NA rate 

CC var. 

AC var. 

sgm 

0.01 

5.381073e-06 

8.068427e-05 

1 

0.10 

0.0001015785 

0.0009777008 

1 

0.10 

0.002390376 

0.001291069 

5 


Table 5: Simulated Data, PC A 

In all cases, AC did quite well. However, we also compared CC and AC on data generated 
as 

n <- 2500 

p <- 2 

pi <- p + 1 

a <- 5 
b <- 8 

ones <- matrix{rep{1,p),ncol=l) 
z <- matrix(nrow = n, ncol = pi) 
z[,l:p] <- runif{n*p,min=a,max=b) 
z[,pl] <- 

z[,l:p] %*% ones + sgm * runif{n,min = -0.5,max = 0.5) 

As seen in Table |5] AC does eventually dominate, but only for the larger value of sgm, and 
AC does considerably worse than CC before that. 

4. AC in Principal Components Analysis 

Once one uses AC in the context of covariance matrices for linear regression analysis, it is 
natural to do so for PCA. The regtools version, pcac(), is quite simple: 

pcac <- function (indata, scale = FALSE) 

{ 

covcor <- if (scale) 
cor 

else cov 

cvr <- covcor{indata, use = "pairwise.complete.obs") 

tmp <- eigen(cvr) 

res <- list {) 

if (any{tmp$values < 0)) 

stop ("at least one negative eigenvalue") 
res$sdev <- sqrt{tmp$values) 
res$rotation <- tmp$vectors 
res 

} 

The quantity of interest was the square root of the maximal eigenvalue. AC was much more 
effective than CC on the Pima (Table [^, Census (Table |71l and baseball (Table [Hi data. 









NA rate 

CC var. 

AC var. 

0.01 

3.860661 

0.3721266 

0.05 

23.8738 

1.976418 

0.10 

64.26592 

4.95431 


Table 6: Pima Data, PCA 


NA rate 

CC var. 

AC var. 

0.01 

32403.34 

4498.546 

0.05 

147780.2 

20018.99 

0.10 

562266.5 

64522.77 


Table 7: Census Data, PCA 


NA rate 

CC var. 

AC var. 

0.01 

0.01391677 

0.002439572 

0.05 

0.07892307 

0.01110466 

0.10 

0.2025108 

0.02432591 


Table 8: Baseball Data, PCA 


NA rate 

CC var. 

AC var. 

0.01 

0.0001565521 

1.412733e-05 

0.05 

0.001146238 

7.169807e-05 

0.10 

0.004132952 

0.0001567328 


Table 9: Simulated Data, PCA 


























































For the simulated data z, as in Section 13.41 the results were a little different, with AC still 
doing very well, but with a caveat. AC performed well, as seen in Table|9] But it sometimes 
failed, due to negative eigenvalues. Of 100 trials for each of the NA rates of 0.01, 0.05 and 
0.10, there were 0, 7 and 13 instances of negative eigenvalues. This is related to the concern 
about possible lack of positive definiteness mentioned earlier. 

It must be noted that this distribution is highly artificial. The square roof of fhe populafion 
maximal eigenvalue is abouf 2.88, quife small in comparison fo fhe populafion mean of 
abouf 65 for fhe lasf variable in fhe dafa. Neverfheless, fhe above resulfs should be kepi in 
mind. 


5. AC in the Log-Linear Model 

To our knowledge, fhis is fhe firsl affempf fo use AC oulside of fhe realm of eslimafion of 
covariance mafrices@ 


5.1 Motivation and Method 

We will illustrate the method here in the 3-factor setting, using the formulations of (Chris¬ 
tensen, 1998, Chapter 3). Call the factorx X, Y and Z. 

As our example computation, take the model in which X and V are conditionally indepen¬ 
dent, given Z. Then the probability of an individual falling into cell ijk is 


Pijk = = j,Z = k) (8) 

= P{Z = k) P{X = i,Y = j\Z = k) (9) 

= P{Z = k) P{X = i\Z = k) P{Y = j\ Z = k) (10) 

_ Pi.k P.jk 
P..k 


This is a perfect opportunity for AC. For instance, we estimate pj.fc as 


1 A 

Pi.k = 2^ lXm=i,Zm = k (12) 


where is the number of data points in which X and Z are intact. 


5.2 Implementation 

This is all implemented in the function loglinac() in regtoolsj^ It works as follows. 

First, AC is used to estimate all the model quantities, e.g. pi ^ above. These are all multipled 
by the total number of observations, yielding estimated expected cell frequencies. The latter 

*By contrast, there is an extenisve literature on MI methods for generalized linear models, including the 
log-linear model. See (Ibrahim et al, 2005). Note by the way that Amelia is not appropriate for this setting, 
due to its assumptuon of multivariate normality for the data. 

®The log-linear model portion of regtools is just a prototype. At present, it handles only the 3-factor case, 
and does only point estimation. 





NA rate 

CC var. 

AC var. 

0.01 

4.395758e-05 

2.781903e-05 

0.05 

0.0002362016 

0.0001513719 

0.10 

0.0005367046 

0.000360953 


Table 10: UCB Admissions Data, Log-Linear Model 


are then treated as “observed cell counts,” and fed into R’s loglin() function. These produce 
the correct log-linear model coefficients 0 

Though loglinO takes its input data in the form of an R table, in order to use AC we need a 
data frame. The function tbltofakedf() creates a dataframe from a table for this purpose. 


5.3 Empirical Evaluation 

Here we used the UCBAdmissions table built-in to R. Continuing with the above example, 
the model used was that in which the factors Admitted and Gender were conditionally 
independent, given Department. The call is 


uca <- tbltofakedf(UCBAdmissions) 
loglinac {uca,list{c{l,3) ,c{2,3) ) ) 


As seen in Table [TOl once again AC can yield substantial improvements. 

6. Back to the MI Issue 

As mentioned, there has been concern about AC in two senses: positive definiteness of 
covariance matrices, and stringency of assumptions. Here we revisit both of these issues. 


6.1 Positive Definite Covariance Matrices 

One of the concerns that have arisen for the AC method in the past was possible lack of 
positive definiteness of U'U in (|3]l. Yet Marsh found that this is rarely a problem (Marsh, 
1998)0 

Furthermore, the problem can occur in Amelia as well. This is because the procedure re¬ 
places NA values in the bootstrapped versions of the original data by Os. Indeed, the Amelia 
code does include checks for this, halting the procedure upon detection of a problem. 

*°Of course, numbers such as the Pearson’s test that come out of this are not valid. 

** As noted earlier, though, we did occasionally encounter negative eigenvalues in the simulated data in our 
PCA studdy. 









6.2 Assumptions 


The issue of assumptions is more delicate, especially since, as is well recognized, the as¬ 
sumptions involved with CC, AC and MI are difficult to check using the data. 

Let Y denote a variable of interest, and let M be 1 or 0, depending on whether Y is missing. 
Also, let D denote the vector of the other varialbes, which for simplicity we assume are 
never missing. For the same reason, we also assume the variables are discrete-valued rather 
than continuous. 

CC and AC assume a Missing Completely at Random (MCAR) setting, which is usually 
defined as something likJ^ 


P{M = l\Y = s,D = t) = P{M = 1) (13) 

where t is in general vector-valued. Thus M is independent of {Y,D). Turning this around, 
we have 


P{Y = s,D = t\M = i) = P{Y = s,D = t) (14) 

for i = 0,1. In other words, the distribution of {Y, D) is the same, whether Y is missing or 
not, and thus inference made from the cases in which Y is observed generalize properly to 
the full distribution of {Y,D). 

MI assumes somewhat less, a condition known as Missing at Random (MAR). In our con¬ 
text here, this is defined as 


P{M = 1\Y = s,D = t) = P{M = 1\D = t) (15) 

A typical example of the idea behind MAR is given in (Cohen and Cohen, 1983), concern¬ 
ing a study of student motivation in a classroom survey. We might surmise that students 
who have low levels of motivation are less likely to answer the survey question, Y, con¬ 
cerning their level of motivation. But other factors D, such as socioeconomic status may 
explain Y so well that (fTSl) holds. The problem of course is that the predictive ability of D 
may not be strong enough to justify ([Tsl) . Moreover, in practice some of the values in the 
vector D will also be missing, further weakening the MAR assumption. 

Also, in the case of Amelia in particular, recall that in its EM computations, it replaces NA 
values by Os, possibly producing further bias. 

The literature on missing data often includes casual comments to the effect that use of CC 
in settings in which MCAR fails, but in which MAR holds, results in bias. Actually, this is 
not necessarily the case, as will be discussed in the next two sections. Though some careful 
treatments exist for the regression case, such as (Glynn and Laird, 1986), the analysis here 
will go into greater generality, i.e. will not be limited to expected values, and in any case is 
simple enough to include here. 

'^There is some variation in the literature on the details of the assumptions discussed here. 



6.2.1 Estimation of Conditional Quantities Under MAR 


Let’s see what happens under MAR in the case of regression analyses and other types of 
association analysis. 

Rewrite (ITSl) as 


P{Y = s\D = t,M = i)) 


P{Y = s,D = t,M = i) 

P{D = t,M = i) 

P(M = i\Y = s,D = t) P{Y = s,D = t) 

P{D = t,M = i) 

P{M = i\Y = s,D = t) P{Y = s\D = t) P{D = t) 


P{D = t,M = i) 

P{M = i\D = t) P{Y = s\D = f) P{D = t) 
P{D = t,M = i) 

P{Y = s\D = t) 


(16) 

(17) 


where the next-to-last equality comes from ([Tsl) . 

In other words, if we are interested in the relation between Y and D, say by performing 
regression analysis of y on 77 — i.e. modeling the conditional distribution of Y given D 
— our being deprived of the missing values of Y will not bias our regression analysis. 

In fact, (fTTI) has the rather ironic implication: 


The MAR assumption is meant to apply to situations in which CC and AC 
ostensibly cannot be used. Yet, if our goal is regression analysis or other types 
of measures of association, CC and AC can indeed be used in MAR settings 
after all. 


6.2.2 Estimation of Unconditional Quantities Under MAR 
On the other hand. 


P{Y = s\M = 0) 


P{Y = s,M = 0) 
P{M = 0) 

P{M = o|y = ^) 

P{M = 0) ^ 


s) 


(18) 

(19) 


In other words, our estimate of P(y = s), an unconditional quantity, may be biased upward 
or downward. Take the student motivation example, for instance. For values of s coding 
high motivation, we surmise in ([T^ . 


p(M = o|y = s) 
P{M = 0) 


( 20 ) 


thus causing an upward bias in the intact data. 









7. Conclusions and Future Work 


This work has found the following: 


• Studies on various real data sets were presented here that showed that (under the 
MCAR assumption), AC ean greatly outperform CC. 

• Although MI is thought of as a method to use when AC’s MCAR assumption does 
not hold, under Mi’s MAR assumption, AC still produees statistieally eorreet results 
for regression analyses and other models of assoeiation. 

• Situations in whieh MAR holds but MCAR does not may be rather rare. 

• MI eomputation is extremely slow, and does not seem to be any better statistieally 
than AC. 

• Thus, for regression/assoeiation analysis, AC may aetually be a eompetitive alterna¬ 
tive to MI. 


Clearly, though, these eonelusions are tentative. Mueh more investigation needs to be done 
on MI, ineluding its statistieal effieieney relative to AC. 
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